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Abstract 

Background: Epstein-Barr Virus (EBV) has a great co relationship with human 
malignancies such as gastric carcinoma. Synonymous codon investigations in 
viruses could help designing vaccine, to generate immunity. 

Codon Adaptation Index (CAI) has measured translation elongation rate, among 
the highly expressed genes. The aim of this study was: usage of "CAI" to 
measure translation efficiency to know how fast EBV-GD1 could produce its 
proteins. 

Methods:The complete genomic sequences of human herpes virus 4 strain GDI 
have retrieved from http://www.ncbi.nlm.nih.gov/sites/gquery (GenBank 
accession no. AY961628) to extract all protein-coding genes. The sequences 
have analyzed with DAMBE software. 

Results: The results have shown that CAI values for the EBV-GD1 genes were 
0.76356 ± 0.02957. The highest and lowest CAI values were 0.82233 and 
0.68321 respectively. The results have shown that highly expressed genes mostly 
had more codon usage bias than low expressed genes. 

Conclusion: The results provide and introduce not only a system, but also the 
principles in order to understand the pathogenesis and evolution of EBV-GD1, to 
open a window, in order to make a better product or vaccine to challenge with 
the virus. 
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Introduction 

The Codon Adaptation Index (CAI) measures the 
synonymous codon, using bias for DNA or RNA 
sequence. CAI has known to be an excellent 
predictor of gene expression in prokaryotes and 
unicellular eukaryotes. CAI has evaluated the effect 
of natural selection in pattern of codon usage, and 
prediction of gene expression level [1, 2] to find 
highly expressed genes [3, 4], virus genes 
adaptation evaluation and their hostages [1], 
indication of heterologous gene expression [5], 
comparing organisms for codon usage favorites [1], 
to find the genes transfection horizontally [6-8] using 
the genomic codon for bias detection in genomes [9] 
to study the cell cycle species [10], to optimize DNA 
vaccines [11], gene therapy [12], vaccine 
development and recombinant therapeutics [13]. 
Some have reported the influence of codon usage on 
the viral cycle among viruses. Adaptation Studies for 
host codon usage, have indicated viral genes which 
codify for critical proteins, tend to use the 
synonymous codons, which mostly represented in the 
host genome [14], but the synonymous codon usage 
within and between genomes could not be used 



equally [15]. Epstein-Barr Virus (EBV) is a ubiquitous 
double stranded DNA virus, derived of human herpes 
virus family, which has B-lymphotropism. More than 
90% of adults have serologic evidence of infection 
with this virus. It has acquired during early childhood, 
but the age of infection is much lower in undeveloped 
countries with low socioeconomic condition [16]. It has 
been documented that gastric carcinoma, Burkitt's 
lymphoma, undifferentiated Nasopharyngeal 
Carcinoma (NPC), Hodgkin's disease, B and T-cell 
lymphoma, and B-cell lympho proliferations among 
the immune compromised patients could cause by 
EBV [17-20]. EBV infection is ubiquitous. Iran has a 
high incidence rate of gastric carcinoma with an 
annual incidence of 26.1 per 1 00,000 for males and 
11.1 for females [21]. In bio pharmacology, 
researchers have interested to improve translation 
efficiency that is derived from protein production. 
Unfortunately, experiments are tedious and the 
reality is much more complicated. In the current study, 
DAMBE software (version 5.3.27) has used to assess 
CAI, to realize how fast EBV-GD1 could produce its 
proteins. These data might provide and introduce a 
system and principles in order to understand the 
pathogenesis and evolution of EBV-GD1. 
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Materials and Methods 

The research study has started in Winter of 201 2. 
All bioinformatics analysis has performed at 
bioinformatics facility of Faculty of Science in 
University of Zabol. Sequences of the genome 
segments of human herpes virus 4 strain GDI 
(GenBank accession no. AY961628) have retrieved 
from http:/ /www.ncbi. nlm.nih.gov/sites/gquery 

(GenBank accession no. AY961628) to extract all 
protein coding genes in order to evaluate the 
effectiveness of CAI from DAMBE [22]. To calculate 
the CAI for any protein-coding sequence: 

(1) 



CAI = £ H 

n is the number of sense codons and the related 
wij value will always be 1 regardless of codon 
usage bias of the gene. CAI of a coding sequence 
(CDS) has calculated from 1) the codon frequencies 
of the CDS and 2) the codon frequencies of a known 
highly expressed genes set (often referred to as the 
reference set) which has been used to generate a 
column of w values: 



(2) 



fij.Kf 



Where fij.ref is the frequency of codon j in 
synonymous codon family i, and Maxfi.ref is the 
maximum codon frequency in synonymous codon 
family i. The codon whose frequency is Maxfi.ref has 
been often referred to as the major codon (whose w 
is 1 ), and the other codons have referred as minor 
codons. The major codon has assumed to be the 
translated optimal codon. 

The CAI value of a CDS has calculated as below 
equation: 

(3) 



r,4/=exp 



III,, 



Where m is the number of synonymous codon 
families, iv is the number of synonymous codons 
between the codon family i, and fij is the frequency 
of codon j in codon family i. The exponent is simply a 
weighted average of ln(w). 

The maximum CAI value is 1 [23]. Relative 
Synonymous Codon Usage (RSCU) measures codon 
usage bias for each codon family. It is calculated 
directly from input sequences. RSCU is a codon- 
specific index for codon usage, whereas CAI is a 
gene-specific index for codon usage, which related 
to gene expression [23]. The general equation for 
RSCU is: 

(4) 



RSCU 9 -. 



I £ CodFi 



■) 



i is codon family, j is specific codon within the 
family [23]. For example, i for alanine codon family 
is GCU, GCC, GCA, and GCG, then j would be a 
specific codon such as GCU. RSCU measures codon 
usage bias for each codon family. RSCU is 1 
whencodon usage bias does not exist, but RSCU 
would be higher than 1 when its codon is either 
overused or vice versa [22]. 

Results 

Human herpes virus 4 strain GDI genome 
segment sequences have used to evaluate the 
effectiveness of CAI from DAMBE. The results have 
shown that CAI values for the EBV-GD1 genes were 
0.76356 ± 0.02957 (Table 1 ). 

The highest and lowest CAI values were 0.82233 
and 0.68321 respectively. The results have shown for 
alanine codon family (as an example), genes with 
high-CAl have more codon usage bias with highest 
RSCU being 2.923 and the lowest being only 0.246. 
In contrast, for the low-CAl genes, the highest and 
lowest RSCU is 2.797 and 0.241 (Table 2 and 3). 
The results have shown that highly expressed genes 
mostly had more codon usage bias than lowly 
expressed genes (Figure 1 ) but ANOVA for RSCUJH 
and RSCU_L genes , has not significantly shown 
difference (P>0.05). 
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Table 1. Output of codon adaptation index (CAI) for EBV-GD1 
(Mean: 0.76356; STD: 0.02957) 



SeqName 


Seq Len 


CAI 


SeqName 


Seq Len 


CAI 


unknown 


1 736 


3954 


0.76709 


unknown | 98764 


651 


0.78868 


unknown 


9710 


510 


0.76776 


unknown | C99460 


2427 


0.78326 


unknown 


36258 


1353 


0.70990 


unknown | 1 03578 


834 


0.821 15 


unknown 




1008 


0.76740 


unknown | 1 06768 


1215 


0.77914 


UNKNOWN 


A7A SS 


1 77*? 


0 7747ft 


iink-nriwn 1 f"l ClftlYft 

UIIMIUWII | V_ 1 WOO/ O 




\J>/ •J*JV7 


unknown 


AQ'\ Z.A 
^7 1 o** 




0 7AhAA 


iin|/ nri wn 1 CI 1 1 

UNKNOWN | 1 1 1 %J f Z. 


7 7\J 


0 7 "58.49 










rtrftknk o Mr\ A 
fJl UUUUIc L/IN/A 






i in vf n r\ \ki n 
Ul 1 l\l HJ W 1 1 




9528 


0 76936 

\J • / U7JU 


V\C\c\< C\C\'\Y\C\ 

\j {ak. \\ <a y 1 1 1 y 

Drotein 1 1 1 2569 


2070 


0.78928 


unknown 


C59248 


371 7 


0.76630 


unknown 1 1 1 2569 


975 


0.76398 


unknown 


62966 


1 092 


0.76266 


unknown 1 CI 1 3494 


1008 


0.771 71 


unknown 


641 36 


2478 


0.79963 


unknown 1 CI 1 4482 


1 521 


0.7531 7 


unknown 


66629 


906 


0.801 36 


unknown CI 1 5975 


675 


0.801 52 


unknown 


67628 


1212 


0.76329 


unknown | CI 1 7993 


702 


0.72556 


unknown 


68847 


1 071 


0.771 03 


unknown | CI 1 8758 


1 260 


0.75635 


unknown 


C70473 


1314 


0.77520 


unknown | CI 20031 


903 


0.781 73 


unknown 


C71 899 


1 1 7 


0.76729 


unknown | CI 20952 


41 43 


0.8031 9 


unknown 


C71 987 


2622 


0.76897 


unknown | 1 25621 


1725 


0.781 92 


unknown 


74654 


654 


0.76750 


unknown | CI 28546 


2118 


0.77372 


unknown 

\J 1 1 1\ 1 l\J W 1 1 


C75368 


834 


0.79476 


unknown 1 CI 30668 


1 821 


0.75959 


unknown 


76277 


306 


0.731 76 


unknown | 1 32490 


744 


0.71 1 1 2 


unknown 


76655 


486 


0.71418 


unknown | 1 33046 


1710 


0.76698 


unknown 


C77160 


2568 


0.7371 5 


unknown 1 1 3555"Z 


1815 


0 78338 


unknown 


C77297 


444 


0.72631 


unknown |C1 37409 


744 


0.75202 


unknown 


79820 


357 


0.68321 


unknown | CI 40486 


2772 


0.74451 


EBNA3B (EBNA4A) 
latent protein | 82903 


2814 


0.72716 


unknown | CI 49527 


708 


0.71853 
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SeqName 


SeqLen 


CAI 


SeqName 


SeqLen 


CAI 


EBNA3C latent 
protein |85921 


3027 


0.741 98 


unknown 1 C~l 501 98 

U 1 IIS.I IU W 1 1 | x_. 1 1 7 U 


1 407 


0.74320 


unknown | C89046 


669 


0.73457 


unknown | 1 50236 


309 


0.72173 


Z protein |C8981 1 


735 


0.73358 


unknown | CI 51 61 6 


936 


0.78017 


unknown | C90996 


1815 


0.79433 


unknown | CI 531 52 


3045 


0.82233 


unknown | 9281 2 


930 


0.77330 


unknown | CI 56202 


2571 


0.79823 


unknown | 93932 


161 1 


0.73806 


unknown | CI 60837 


3384 


0.80318 


unknown |95580 


1923 


0.70636 


unknown | CI 64308 


660 


0.81822 


unknown | 97588 


41 1 


0.77062 


unknown | 1 64957 


663 


0.79950 


unknown | 97983 


765 


0.76167 


unknown | CI 66757 


180 


0.72802 



Table 2. RSCU genes with low-CAl value (RSCU_L) for EBV-GD1 



Codon AA 


ObsFreq 


RSCUJ. 


Codon AA 


ObsFreq RSCU_ 


_L 




UAG 


* 


0 


0.000 


UGA 


* 


1 


1.000 


GCU 


A 


12 


0.361 


UAA 


* 


2 


2.000 


GCC 


A 


20 


0.602 


GCG 


A 


8 


0.241 


UGU 


C 


3 


0.750 


GCA 


A 


93 


2.797 


GAU 


D 


34 


1 


.172 


UGC 


C 


5 


1.250 


GAG 


E 


21 


0.792 


GAC 


D 


24 


0.828 


UUU 


F 


13 


1.444 


GAA 


E 


32 


1.208 


GGU 


G 


29 


0.410 


UUC 


F 


5 


0.556 


GGC 


G 


29 


0.410 


GGG 


G 


76 


1.074 


CAC 


H 


5 


0.455 


GGA 


G 


149 


2.106 


AUU 


1 


17 


1.821 


CAU 


H 


17 


1.545 


AUC 


1 


6 


0.643 


AUA 


1 


5 


0.536 


AAG 


K 


17 


1.478 


AAA 


K 


6 


0.522 


cue 


L 


14 


1 


167 


CUA 


L 


14 


1.167 


CUU 


L 


13 


1.083 


CUG 


L 


7 


0.583 


UUG 


L 


8 


1.000 


UUA 


L 


8 


1.000 


AAC 


N 


12 


1 


143 


AUG 


M 


20 


1.000 


CCA 


P 


68 


1.744 


AAU 


N 


9 


0.857 


ecu 


P 


40 


1.026 


CCC 


P 


33 


0.846 


CAA 


Q 


28 


1.217 


CCG 


P 


15 


0.385 


AGA 


R 


19 


0.844 


CAG 


Q 


18 


0.783 


CGA 


R 


10 


1.000 


AGG 


R 


26 


1.156 


CGG 


R 


12 


1.200 


CGC 


R 


9 


0.900 


AGC 


S 


11 


0.759 


CGU 


R 


9 


0.900 


UCA 


S 


24 


2.043 


AGU 


S 


18 


1.241 


UCG 


S 


4 


0.340 


UCC 


S 


11 


0.936 


ACC 


T 


14 


1 


167 


UCU 


S 


8 


0.681 


ACG 


T 


7 


0.583 


ACA 


T 


17 


1.417 


GUU 


V 


10 


0.930 


ACU 


T 


10 


0.833 


GUC 


V 


12 


1 


116 


GUG 


V 


11 


1.023 


UGG 


w 


12 


1.000 


GUA 


V 


10 


0.930 


UAU 


Y 


10 


1.429 


UAC 


Y 


4 


0.571 



ObsFreq: observation frequency; AA: amino acid.RSCU_L: Low relative synonymous codon usage. 
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Table 3.RSCU genes with high-CAl value (RSCUJH) for EBV-GD1 



Oodo n 


AA 


O he Fran 


RSCU l-l 


Cod 0 n 


AA 


^1 he F r£*n 


RSCU l-l 




* 




i nnn 


1 If^A 


* 


n 


0 000 


CC\ 1 


A 


Q 
o 


n OAh 


1 IAA 


* 


0 

z. 


9 000 


CCC 


A. 




9 99T 

£■•7 £.0 


CCC 


A. 


1 8 






r 
^_ 


a 

o 


0 T9n 

U.O 7 vJ 


CCA 


A. 


Q 

7 


0 977 


f^AI 1 




22 


W.J 1 0 


1 \cc 


c 


TT 
OO 


1 ai n 


CAC 




70 


1 750 

1 ./ JU 


CAC 




uo 


1 .482 


1 II II 1 


p 


TT 

OO 


f) 971 
U. 7 / 1 


GAA 


F_ 


1 0 

1 u 


0 950 


CC\ 1 




■3 
o 


0 1 TO 


1 11 ir 


c 

r 


T5 

OO 


1 ri99 

1 .\J £. 7 


CCC 




41 


1 809 


CCC 




T9 

07 


1 .71 4 


CAC 


l-l 


Tl 


1 .442 


CCA 




Q 
0 


0 T59 


Al II 1 


1 


1 ^ 
l o 


f) 70T 


CA\ 1 


l-l 


1 2 


0 558 


ai ir 


1 


AO 


1 875 

1 .0/ 0 


AUA 


1 


Q 

7 


0.422 


AAf% 


i\ 


AT 


1 70T 

1 ./ uo 


AAA 

AAAA/A 


i\ 


1 1 


0 997 


n ir 


L 


"if! 


1 .41 5 


n ia 


L 


Q 

7 


0 99f) 


n ii i 


L 


4 


f) f)98 


n ir; 


L 


9T 
70 


9 9AS 


1 II If; 


L 


1 f) 


1818 
1 .0 1 0 


1 II IA 


L 




n 1 89 


AAC 


N 

1 N 


T7 

o/ 


1 A09 

1 iUU 7 


Al If; 


M 


99 

£- 7 


1 nnn 


rrA 

V— V— /A 


P 


1 


0 779 


AAU 


N 


Q 

7 


n T9i 

U.O 7 1 


rn i 


D 

r 


1 A 

1 ^4 


f) 797 


rrr 


p 

r 


TA 
Ou 


1 87n 

1 .0/ u 


CAA 


w 


1 0 


0 41 7 
u.^+ 1 / 


rrr; 

V— V— \J 


p 


1 2 


n A9T 


AGA 


R 


Q 

7 


0 '599 


CAC 


ft) 


TS 
0 0 


1 5ST 


rc;A 


R 


4 


f) 91 9 


ACC 


R 


95 
^ 0 


1 .471 




R 


99 


1 '589 

1 .O O 7 


CCC 


R 


T4 


1 SAT 


AGC 


s 


27 


1 .636 


CGU 


R 




0.329 


UCA 


S 


7 


0.444 


AGU 


S 


6 


0.364 


UCG 


s 


16 


1.016 


UCC 


S 


31 


1.968 


ACC 


T 


35 


1.892 


UCU 


S 


9 


0.571 


ACG 


T 


25 


1.351 


ACA 


T 


12 


0.649 


GUU 


V 


4 


0.143 


ACU 


T 


2 


0.108 


GUC 


V 


37 


1.321 


GUG 


V 


66 


2.357 


UGG 


W 


15 


1.000 


GUA 


V 


5 


0.179 


UAU 


Y 


1 1 


0.379 


UAC 


Y 


47 


1.621 



ObsFreq: observation frequency; AA: amino acid.R5CU_H: High relative synonymous codon usage. 



3.5 




Figure l.lt shows relative synonymous codon usage (RSCU) for high-CAl and low-CAl genes (RSCU_H and 
RSCU_L, respectively) for 64 codons of EBV-GD1 . 
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Discussion 

In molecular biology, one of the fundamental 
questions is genetic codes. In microorganisms, the 
unequal usage of synonymous codons, due to both of 
the mutation and the pressure of usual normal 
selection, has been accepted as the most common 
hypothesis which could effect on translation level. 

The CAI has used highly expressed genes from a 
species to evaluate the relative merits of each codon. 
CAI has also used for gene expression and 
translation efficiency [23]. The mRNA translation 
efficiency has depended partially on mRNA coding 
strategy, and has reflected codon usage bias. Codon 
usage bias has often determined by codon-specific, 
as well as the other existing gene-specific. 

A representative of codon-specific could be the 
RSCU or relative synonymous codon usage [24], and 
a representative of the gene-specific could be the 
codon adaptation index or CAI. CAI is a measure 
index of translation elongation rate according to our 
finding of highly expressed genes [25]. Clarifying in 
a different better way, highly expressed genes 
would be under pressure to use abundant, or 
common, or cheap amino acids. On the other hand, 
we couldn't produce a big mass of the protein that its 
amino acids components would be rare or expensive. 
According to previous data, highly expressed genes 
which would use codons,have distinguished by the 
most abundant tRNA, in order to code each amino 
acid. For this matter, highly biased codon has used in 
highly expressed genes, especially in organisms with 
rapidly replication [23-28]. 

Finding the highly and lowly expressed genes in 
organisms, we might be able to select them as the 
main targets in pharmacology, especially in vaccine 
production. CAI has calculated with a reference set 
of highly expressed genes. The maximum CAI is 1, 
and the minimum is 0. In general, the higher that the 
CAI value would be, caused the mRNA have 
translatedmuch more efficient. Highly expressed 
human genes typically have CAI value above 0.7, 
have given the human reference set of highly 
expressed genes [23]. The results have shown CAI 
values for the EBV-GD1 genes were 0.76356 + 
0.02957. Our result have agreed with Knipe et al. 
(2001) that EBV is an extremely efficient virus, which 
has infected a large majority of the adult 
population, as well as following primary infection, 
EBV has remained in the infected host as a lifelong 
asymptomatic infection [26]. Xia (2007) has 
determined that the viruses which have caused acute 
diseases, as well as being pathogen, need to 
translate their mRNAs efficiently [27]. Figure 1 plots 



the RSCU for the high-CAl genes (RSCUJH) and low- 
CAI genes (RSCU_L) of the 64 codons. It has shown 
that high-CAl genes (representing highly expressed 
genes) have RSCU values deviating much more from 
1 than the low-CAl genes (representing lowly 
expressed genes) relatively. The results have shown 
that highly expressed genes mostly had more codon 
usage bias than lowly expressed genes (Figure 1) 
but ANOVA has not shown a significant difference 
(P>0.05). This might be related to EBV, that has two 
different form of existence: latent and productive. 
The EBV genes that have been expressed during 
latency, has show codon usage highly different from 
the genes that would be expressed during lytic 
growth [29]. For example, what could we say about 
the tRNA carrying alanine? From the results, GCC is 
the most frequently used codon, but we might predict 
that tRNA Ala / AGG might be the most abundant. How 
could we test this prediction? Unfortunately this is 
extremely difficult experiment and all these data 
could be used in order to highlight the genes with 
high rate of expressions, related to its importance in 
EBV-GD1, then for this important reason might 
introduce a basis to understand the pathogenesis of 
EBV-GD1 to open a window to produce a better 
product or vaccine, in order to challenge with the 
virus. 

Conclusion 

The results might provide and introduce a system 
and its principles, in order to understand the 
pathogenesis then evolution of EBV-GD1 and 
opening a window to make a better product or 
vaccine to challenge with the virus. Based on the 
results, we could find which genes or sequences 
would be highly expressed, or under strong natural 
selection to maximize translation efficiency and 
accuracy in order to optimize their codon usage. To 
say in a different way, selection should be weak for 
lowly expressed genes that codon usage might 
largely depend on mutation bias [27]. 
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